Relational Learning: A Web-Page Classification Viewpoint
نویسنده
چکیده
This paper organises some general observations on Relational Learning which arose from research into classifying Web pages. The motivation for this piece is to contribute towards developing a broad overview of the field, so as to understand which aspects of Relational Learning are common to all domains, and which aspects are peculiar to specific domains. Hence the views presented here are necessarily only one piece of the puzzle, and it is hoped that analogous perspectives from other practitioners will improve upon the picture being developed here. With that in mind, let’s take a whirlwind tour of Relational Learning as seen through the eyes of someone interested in classifying Web pages. 1 Relational Learning Problems Relational learning problems, by definition, require data in some relational format. However, there is still some variety in the nature of the classification problems that are considered. Being aware of these variations can be useful when it comes to choosing a particular learning algorithm for the task at hand. The following general categories of relational learning problem are encountered when considering machine learning on the Web. Do problems from other domains fit into this categorisation? Are there more categories? 1.1 Classifying Graphs Instances Problems like classifying Web sites fall into this category. Instances are graphs, but usually not connected to each other and the task is to classify the full relational description of an instance. 1.2 Classifying Nodes Instances Problems like classifying Web pages fall into this category. Instances are nodes, and paths may or may not exist between them. The task is to classify an instance using both the properties of that instance and properties of the “neighbourhood” of that instance. Since nodes can appear in the neighbourhood of more than one instance, thorny independence issues can arise when trying to evaluate whether such shared neighbours are useful predictors or not. 1.3 Classifying Node Tuples Instances Problems like classifying pairs of Web pages (e.g. is this course being taught by this professor) fall into this category. Instances are tuples of nodes, normally connected in some way. The task is to classify an instance using information about how the nodes relate to each other. For the moment, I’ve not broken out edge classification into a separate section. Edge classification on the Web seems to be covered well enough by classifying the pair of web pages linked by the edge as a tuple. Are there compelling examples from other domains that warrant edge classification being considered independently? 2 Relational Learning Features Being aware of what kinds of regularity your particular relational learning problem may exhibit is important when designing and evaluating algorithms for that problem. This section details some classes of predictive regularity found in Web page classification problems. Are other forms of regularity found in other relational learning domains, or by algorithms different to the one used in my work (FOIL)?
منابع مشابه
Web Page Classification Using Relational Learning Algorithm and Unlabeled Data
Applying relational tri-training (R-tri-training for short) to web page classification is investigated in this paper. R-tri-training, as a new relational semi-supervised learning algorithm, is well suitable for learning in web page classification. The semi-supervised component of R-tritraining allows it to exploit unlabeled web pages to enhance the learning performance effectively. In addition,...
متن کاملCollective Classification with Relational Dependency Networks
Collective classification models exploit the dependencies in a network of objects to improve predictions. For example, in a network of web pages, the topic of a page may depend on the topics of hyperlinked pages. A relational model capable of expressing and reasoning with such dependencies should achieve superior performance to relational models that ignore such dependencies. In this paper, we ...
متن کاملWeb mining with relational clustering
Clustering is an unsupervised learning method that determines partitions and (possibly) prototypes from pattern sets. Sets of numerical patterns can be clustered by alternating optimization (AO) of clustering objective functions or by alternating cluster estimation (ACE). Sets of non–numerical patterns can often be represented numerically by (pairwise) relations. These relational data sets can ...
متن کاملUser Evaluation of a System for Classifying and Displaying Political Viewpoints of Weblogs
This paper presents a Web-based user evaluation of a system for classifying and presenting political viewpoints of blog posts. The system is based on a classification model trained using a supervised learning algorithm, and the data set consists of recent posts from blogs that are self-identified as a liberal or a conservative viewpoint. We first discuss the classification process. Then, with a...
متن کاملRelational Learning
Most of the content-based approaches to text and web document classification explored in other related projects are based on the bag of words model, well known from the area of Information Retrieval. This model is simple and efficient, but fails to capture many additional document features such as the internal HTML structure, language structure and inter-document link structure. All this howeve...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003